Network Transfer Entropy and Metric Space for Causality Inference 
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A measure is derived to quantify directed information transfer between pairs of vertices in a 
weighted network, over paths of a specified maximal length. Our approach employs a general, 
probabilistic model of network traffic, from which the informational distance between dynamics 
on two weighted networks can be naturally expressed as a Jensen Shannon Divergence (JSD). Our 
network transfer entropy measure is shown to be able to distinguish and quantify causal relationships 
between network elements, in applications to simple synthetic networks and a biological signalling 
network. We conclude with a theoretical extension of our framework, in which the square root of 
the JSD induces a metric on the space of dynamics on weighted networks. We prove a convergence 
criterion, demonstrating that a form of convergence in the structure of weighted networks in a family 
of matrix metric spaces implies convergence of their dynamics with respect to the square root JSD 
metric. 



I. INTRODUCTION 

Complex systems in diverse fields are often repre- 
sented as weighted networks [TH3]. Inferring information 
transfer between network elements from such repre- 
sentations can provide important insights into system 
structure and perturbations @HB]. This paper proposes 
a general methodology to quantify such transfer based 
on information theory and probability on graphs. 
Previous attempts to infer dynamics from weighted 
networks include an interaction model based on electri- 
cal circuitry, to discover active pathways contributing 
to the pathogenesis of the brain cancer Glioblastoma 
multiforme [U [5] . Reference uses a model of infection 
transmission, proportional to interaction frequency, to 
identify the spread of disease through social networks. 
Such case by case approaches have proved informative, 
however they are often tailor made to their applications 
and the general quantification of information transfer 
in weighted networks currently lacks a theoretical 
foundation. 

The purpose of this paper is to construct an information 
theoretic measure, network transfer entropy, quantifying 
the directed amount of information transferred between 
any two vertices in a weighted network, with minimal 
assumptions and general applicability. Following con- 
struction (Section II), we demonstrate the measure on 
simple synthetic networks and a biological signalling 
network (Section III). 

In the construction of general measures, one aims 
at an insight into the theoretical concepts governing 
the process being studied. We demonstrate that the 
network transfer entropy framework can be interpreted 
in the context of metric spaces (Section IV). In this 
construction one defines a family of mappings from the 
space of weighted networks (represented by matrices) 



to a family of metric spaces, whose elements describe 
possible signal dynamics on networks. 
We prove a convergence principle, demonstrating that 
a form of convergence of weighted networks in the 
metric space L p (M NxN ) implies convergence in the 
constructed metric space of signal dynamics. This result 
shows that in our general framework, deformation of 
network structure influences network dynamics in an 
intuitive way. This result has real world implications 
in, for example, network drug design, where one wishes 
to modify the chemical affinities of interacting proteins 
in a pathological signalling network (i.e. modify the 
edge weights) to restore a healthy signalling regime (i.e. 
modify the dynamics). Finally, we motivate how certain 
further theoretical problems in network evolution and 
network perturbation can be approached within this 
framework. 



II. NETWORK TRANSFER ENTROPY 

Transfer entropy was introduced by Schreiber, to quan- 
tify the directed amount of information transferred be- 
tween two mutually dependent time series [7] . This prob- 
lem shares several important qualities with our problem 
of information transfer between network vertices, thus we 
follow Schreiber's approach in the derivation of our mea- 
sure. 

The definition of transfer entropy required a model in 
which it was possible to express whether two time series 
influenced each other. For transfer entropy to be widely 
applicable, this model needed to be sufficiently general 
to portray a wide array of diverse systems. It was thus 
intuitive to describe time series as realisations of (approx- 
imately) Markov processes of order k. For such a process 
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I the conditional probability of finding the process in 
state i n +i at time n + 1 satisfies 

P(in+l\in, ■ ■ ■ ,in-k+l) = KWl|*n> ■ ■ ■ ,in-k)- (1) 

These generalised Markov process are not all encompass- 
ing in their descriptive power; for example, they are in 
general not-applicable to studying subsystems of Markov 
processes [5]. However for a broad range of datasets in- 
cluding heart and breathing rate data [7J, magnetoen- 
cephalography data [9] and financial time series [TO] , 
the approximate Markov process model can be justified, 
making transfer entropy widely applicable. 
The choice of a general dynamic model for a weighted 
network requires consideration of the literature. One 
must be careful to ensure the model makes minimal as- 
sumptions yet has sufficient descriptive power to portray 
complex systems. Much work has focused on interaction 
models known as flow networks (see for example in 
which transport from source nodes to sink nodes is sub- 
ject to edge weight dependant constraints. These models 
are useful in optimisation problems where one wants to 
find paths through a network that maximise or minimise 
some function associated with path traversal, and thus 
tend to be used in systems where traffic can be manipu- 
lated, such as supply management [12]. 
Flow networks are less useful in the interrogation of net- 
work dynamics where constraints on traffic are unknown, 
and sink and source nodes and not readily defined. More- 
over, when network dynamics are stochastic and bursty 
rather than continuous flows, such as in social commu- 
nication systems [T3] and gene regulatory networks [13], 
adaptations of flow networks are required. Such adapta- 
tions include discrete flow networks [16] and stochastic 
flow networks 15J, in which the interaction of a vertex 
with neighbours is given by a probability distribution 
proportional to the edge weight distribution. The ele- 
gance of these discrete models is that they may approx- 
imate continuous models (such as flow networks) in the 
large time limit. Such models are generalised, for exam- 
ple by the inclusion of holding rates, in queuing theory 
[17j which with detailed information for parameter es- 
timation can be used to describe and simulate a large 
variety of real world systems. 

Given this literature we take our dynamic model for 
weighted networks as a balance between the descriptive 
power of stochastic networks of queuing theory and the 
simplicity of the stochastic flow networks. We consider 
the following Markovian model for signal dynamic evolu- 
tion. Each vertex is assigned a data derived value quanti- 
fying a signal that the given vertex is capable of forward- 
ing to one its neighbours. The vector containing these 
values for every vertex is referred to as the initial signal 
distribution (ISD) of the network. In real world applica- 
tions this distribution can be qualitatively diverse. For 
example, in biological networks, where vertices represent 
genes, the ISD may quantify the differential expression of 
genes in pathological versus healthy samples. In the air- 
port transportation network where vertices are airports 



and edges connect airports one can fly between directly, 
the ISD may be the number of flights departing from 
each airport over a given time frame [3]. There are no 
restrictions on the ISD other than it being a vector in 
R N , where N is the number of vertices. 
We evolve this signal over the network W — (iuy)^ =1 
via a stochastic matrix P = (py)ii=i defined as 

Sf otherwise 

where A/$ denotes the set of neighbours of vertex i and 
5f denotes the Kronecker delta of i and j. We evolve the 
ISD over multiple discrete time steps. At each time step 
the signal at each vertex i is independently forwarded to 
vertex j G Mi with probability Pij (we emphasise that 
self-edges can be added to the network and the weights 
on such edges would determine the probability that a ver- 
tex maintains its signal over a single time step). Thus the 
number of time steps directly corresponds to the maxi- 
mal path length the ISD has traversed. 
Given an ISD, Xq, and path length, n, for every vertex i 
in the network we can compute the probability distribu- 
tion of the signal at vertex i given that the ISD has been 
forwarded along paths of length n (see Appendix B) . We 
denote this distribution P[JQ|Ao]. 

Given a model of interactions, we wish to quantify paths 
of high or low traffic through the network. To proceed in 
this aim it suffices to identify the directed amount of in- 
formation transferred between any two pairs of network 
vertices during a period of system evolution. To achieve 
this we again turn to Schreiber's methodology. In the 
derivation of transfer entropy the directed amount of in- 
formation transferred from a process J to a process / is 
formulated as the incorrectness of the assumption that / 
is not conditional on J. This quantity can be expressed 
as the Kullback-Leibler divergence (KLD) [18) 

V n(i n+1 I'M loff P^+^n-k+lJD i o\ 

where i" n — (i n , . . . , i m ) for m < n. Thus to quantify the 
amount of information vertex j in our network transfers 
to vertex i over paths of length n we must derive a dis- 
tribution for X l n in which vertex j sends no information 
to vertex i. We must then compute the informational 
distance between this distribution and the above distri- 
bution P[X^ \Xq] in which j is able to communicate with 
i. Clearly if we set the j th row in the stochastic ma- 
trix P to ej (i.e., make j an absorbing state; ej denotes 
the j th element of the standard basis of R N ), then it is 
impossible for vertex j to communicate with any vertex 
i =/= j under our model. Given this modified matrix we 
can compute the probability distribution of X % n given the 
ISD and that j cannot communicate with i. We denote 
this distribution P[X^\Xo,j]. Here we diverge from [7J, 
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however, as the KLD 

is only well defined provided 

{x : P[X l n = x\X ,j] = 0} C {x : PLY* = z|X ] = 0}, 

which is an assumption that does not hold in general. 
Consider, for example, a directed graph on two vertices 
1 and 2, with a single directed edge oriented from 1 to 
2; if we assign the ISD as X = ei, then it is trivial that 
P[Xl = x\X ] = Si and P[Xf = x\X Q , 1] = S° x . 
Thus to quantify the directed amount of information 
transferred from vertex j to i we must employ a different 
measure of statistical distance. There are several possi- 
ble choices available, among the most promising are the 
Jensen-Shannon divergence (JSD), which is a linear com- 
bination of KLDs and the statistical distance introduced 
by Wootters [IS]. Both measures are theoretically rich; 
Wootter's measure was designed as a distinguishably dis- 
tance between pure quantum states after a finite number 
of observations, and applies equally well to distinguish- 
ing two probability distributions. The measure also has a 
geometric interpretation in the context of Hilbert space. 
The JSD of two distributions, quantifies the total KLD 
from each distribution to the average of the two, and thus 
is a measure of distributional similarity. The JSD is also 
the square of a metric over the space of probability dis- 
tributions on a measurable set [20] ■ These two measures 
have been shown to agree to second order in a quantum 
mechanical framework |21j . 
We will use the JSD defined by 



in these examples, we estimated the probability distribu- 
tions P[X l n \X ] and P[X^\X ,j] for all j, using a simula- 
tion. We also devised a method to estimate the statistical 
error in the probability distributions (see Appendix C). 
The first and most simple example we consider is a di- 
rected path of length 5 with equal edge weights (Figure 
1). This induces the stochastic matrix 



JSD(p,q) = - 




p{x) l0| 



m{x) 



q(x) log 



m(x) _ 



1(4) 



where p, q : X — > [0, 1] are probability distributions (with 
no restrictions placed on their kernels) and m = (p+q)/2. 
We select this measure as the metric interpretation is of 
greater use to our theoretical framework. 
Whence we define the network transfer entropy (NTE) 
from j to i over path length n and given an ISD X a by 



JSDiPiX^XolPiX^XoJ]) 



(5) 



This is the central concept of the paper. Note that 
r 5 (ill*) e [0,log2] is inherently asymmetric, and thus 

-AO 

quantifies information transfer through a network in a 
directed sense, permitting the inference of causality. 



III. EXAMPLES 

In order to demonstrate the use of NTE we consider 
three examples, two synthetic networks and one applica- 
tion to biological signal transduction. To evaluate NTE 
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The structure of this network provides a completely pre- 
dictable path for signal transfer and thus is ideal for prov- 
ing the capability of our measure to detect information 
transfer. We consider two ISDs on this network, firstly 
Xq = ei, where the first vertex in the path is given an 
initial signal and all other vertices have no signal to trans- 
fer. If we number the vertices 1 to 5 from the start of the 
path to the end, then it is clear that for this ISD, over 
path length n = 1, vertex 1 sends information to vertex 
2, and no other vertices communicate, for n — 2 vertex 
1 sends information to vertex 3 and vertex 2 sends infor- 
mation to vertex 3 and no other vertices communicate, 
and similarly we can compute all pairwise information 
transfer events up to n = 4 beyond which all signal is 
absorbed at vertex 5 and cannot be transmitted through 
the network. This pattern is precisely what is seen if we 
calculate the NTE between all vertex pairs over different 
path lengths n (Figure 1). 

We next consider the ISD X = (1, 1, 1, 1, 1) T , on the 
same network, in order to demonstrate the ability of the 
NTE measure to discern between situations where net- 
works with identical edge weights have different starting 
signal states. One would expect that with this ISD, for 
n = 1, rather than just vertex 1 forwarding information 
to vertex 2, we have vertex j forwarding information to 
vertex j + 1 for j = 1, . . . , 4, and similar extensions for 
longer path lengths. The NTE measure can detect these 
differences due to initial signal distribution (Figure 1). 
The next network we consider is a slight extension to the 
deterministic path which constitutes a directed feedback 
from vertex 2 to vertex 4 weighted w = x/(l — x) (Figure 
2). This induces the stochastic matrix 
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The network introduces some indeterminism in that if 
vertex 4 holds signal, it can either forward it to vertex 
5 with probability 1 — x or feedback the signal to vertex 
2 with probability x. This essentially sets up a feedback 
loop which dampens the signal received at node 5 over a 
given path length, by a factor dependant on x. We cal- 
culated the NTE for all vertex pairs in this altered path 
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FIG. 2: Matrices showing NTE between all vertex pairs in the modified, weighted path with feedback from vertex 4 to vertex 
2, over path lengths n = 1 — 5, for a range of feedback strengths. Note that as the weight on the edge (4, 2) rises the NTE from 
vertices 1-4 to 5 falls. 



for x — 0.1,0.5,0.9 and found that as x is increased the 
NTE to vertex 5 from all other vertices falls, as expected 
(Figure 2). Thus in the context of these very simple syn- 
thetic networks, the use of NTE as a tool for detecting 
information transfer is clear. 

We next demonstrate NTE in a real world biological net- 
work. To do this we consider the human primary naive 
CD4+ T cell intracellular signalling network analysed by 
Sachs et al. [55], consisting of 11 vertices (Figure 3). In 
this network vertices are proteins which can be phospho- 
rylated and directed edges connect kinases (capable of 
phosphorylating proteins) with their targets. The kinases 
must be in an active state before they can phosphorylate 



a target; activity can be achieved by either phosphory- 
lation by an upstream kinase or activation by a reagent. 
Sachs et al. generated data accompanying this network 
consisting of quantification (by flow cytometry) of the 
amount of phosphorylated protein for each vertex in the 
network following ten independent perturbations. The 
network perturbations consisted of the administration of 
reagents which could either activate or inhibit the kinase 
activity of particular vertices. 

To apply NTE we considered two of these perturba- 
tions, firstly treatment with anti-CD3 and anti-CD28 to 
activate the T-cells and induce flux through the network, 
secondly treatment with anti-CD3, anti-CD28 and psitec- 
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FIG. 3: Matrix displaying differential NTEs computed for every vertex pair over the displayed network, positive values (green) 
correspond to NTEs higher in the network perturbed with anti-CD3 and anti-CD28, whilst negative values (red) are higher in 
the network also perturbed by psitectorigenin, a PIP2 inhibitor. 



torigenin, a reagent which specifically inactivates PIP2. 
We computed the NTE between all vertex pairs in the 
network over paths of length 1 — 5 for the two perturba- 
tions as described in Appendix D (see Figure 5 in Ap- 
pendix D, for matrices of NTEs for each perturbation). 
We found that in the psitectorigenin treated network in- 
formation transfer from PIP2 to the rest of the network 
was reduced over all path lengths (Figure 3) . Specifically, 
information transfer from PIP2 to PIP3 was greatly re- 
duced and information transfer from PIP2 to Plcg was 
reduced over paths of maximal length greater than one 
(implying PIP3 received less information from Plcg via 
PIP2 under psitectorigenin treatment). At longer path 
lengths we also see a reduced information transfer from 
PIP2 to Akt and p38 in the psitectorigenin treated net- 
work. This indicates that specific inhibition of PIP2 can 
lead to decreased Akt and p38 activation downstream of 
PIP2 signalling. 

Interestingly, we also notice that in the PIP2 inhibited 
network, there is increased information transfer from 
PKA to Akt and from PKC to p38. This points at a 
compensatory mechanism, in which inhibition of PIP2 
leading to reduced Akt and p38 activation is compen- 
sated for by PKC dependant p38 activation and PKA 
dependant Akt activation. 



Thus our NTE measure is capable of providing novel in- 
sights into signalling mechanisms in biological networks. 



IV. A GENERAL FRAMEWORK 

In defining our NTE measure we have additionally con- 
structed a family of mappings from the space of weighted 
networks to a family of metric spaces, in which elements 
of the metric spaces correspond to signal dynamics on 
the networks. The mappings and structure of the metric 
spaces are parameterised by the path length parameter 
n, the ISD Xq and the topology (i.e., the zero pattern, 
but not the edge weights) of the network and their con- 
struction is explained in detail in Appendix B. 
This formalism allows a more theoretical treatment of 
dynamics on networks from the perspective of metric 
spaces, and permits a coupling between weighted net- 
work structure and dynamics. In certain fields, under- 
standing the reaction of network dynamics to perturba- 
tions of edge weights is of great importance. This is par- 
ticularly true of network drug design 23 , in which one 
is interested in sequentially deforming the quantitative 
strengths of interactions in a pathological signalling net- 
work (via drugs) into those of a healthy network, with the 
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aim of establishing a healthy gene expression dynamic 
and mitigating the pathology. If this notion of treatment 
is logical within our framework, then one would postu- 
late that convergence in weight distribution of a sequence 
of networks to a limit distribution (in a matrix metric 
space) would imply convergence of the corresponding se- 
quence of dynamics to the dynamics of the limit network 
(in the network dynamic metric space). We state and 
prove a theorem in Appendix E which establishes this 
postulate as true. This result demonstrates an intuitive 
coupling between network structure and dynamics within 
our framework. 

Further theoretical questions may consider which ISDs 
are maintained under different networks, these represent 
persistent (attractor) states of the network information 
distribution. To identify such states we note that every 
graph (if we permit self edges at every vertex to repre- 
sent a non-zero probability of signal maintenance) admits 
a disjoint vertex cycle decomposition 24 . Thus there is 
always a way of sending signal around the network, with- 
out combining signal from two vertices at any one vertex. 
This implies that for every weighted network W, with self 
edges, there must exist a permutation matrix cj>, which 
admits at least one vector x satisfying <f>x = x {e.g., the 
vector (1, . . . , 1) T ), such that P[Xi = 4>x\X = x] > 0. 



The state x thus has a non-zero probability of being a 
persistent information state of the network. 
Questions concerning the evolution of self-assembling 
networks can also be considered in our framework via an 
application of dynamic programming. An introduction 
to this approach is detailed in Appendix F. 



V. CONCLUSION 

We have derived a general information theoretic mea- 
sure, network transfer entropy, for quantifying the 
amount of information transferred between any two ver- 
tices of a weighted network over paths of varying length. 
We have demonstrated our measure on simple synthetic 
weighted networks and applied it to biological signal 
transduction, revealing insights into the robustness of ki- 
nase signalling. We have also constructed a general met- 
ric space framework for dynamics on weighted networks 
and proved a convergence principle relating weighted net- 
work structure to dynamics. We outlined how problems 
in network evolution and network dynamic stability can 
be approached within our framework, formalisation of 
these approaches is a topic of future work. 
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Appendix A: Introduction 

In this appendix we provide certain mathematical 
derivations and technical methodologies to accompany 
the main text. In Appendix B we derive a closed form 
expression for the probability distribution P[AT^|A ], de- 
scribing the signal at a vertex i in a weighted network, 
given an initial signal distribution (ISD), Xq, has tra- 
versed a path of length n. Following the derivation of 
this expression we will explain how the network transfer 
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entropy (NTE) framework leads to the construction of a 
family of mappings from the space of weighted networks 
to a family of metric spaces describing signal dynam- 
ics. In Appendix C we explain how the NTE may be 
estimated from a simulation of the Markovian dynamic 
model introduced in the main text and how error may 
be compensated for in this estimation. In Appendix D 
we explain in detail the application of NTE to biological 
signal transduction, specifically the assignment of an ISD 
and edge weights for each perturbation. In Appendix E 
we state and prove aa convergence theorem in the metric 
space framework. Finally, in Appendix F, we outline an 
approach to network evolution from the perspective of 
dynamic programming. 



Appendix B: PpQ|Xo] and metric space 

To compute the probability distribution P[X£jXo] for 
a given weighted network W = (tuy)^ =1 , with corre- 
sponding stochastic matrix P = (pij)^j =1 (see main text) 
we first note that 



P[Xl = y\X ] 



P[X n = x\X }5l, 



(Bl) 



Where 5f denotes the Kronecker delta of i and j. 

In addition by the Markovian nature of our dynamic 

model 



P[X n = x\X ] = 



E 

,...,x„ 



P[X n = x\X n _ l ]...P[X l \X 



(B2) 



Reducing our problem to the calculation of the transition 
probabilities P[Xk-i\Xk], between states and the states 
themselves which must be summed over. These are not, 
however, immediate. For the calculation consider the 
following: given we know the full signal distribution at 
time-point k G N i.e. X k — X],, then all possible states 
of signal distribution at time-point k + 1 have the form 



X 



fc+l 



A T x k 



(B3) 



Here A = (A i:j ) 



N 



:1 , Aij G {0, 1} is a binary matrix with 
a single non-zero entry in every row; the column index j 
of the non-zero entry in row i corresponds to the unique 
vertex j that i has sent its signal to during the time-step 
k k + 1. We note that in addition A^ — if j $ Afi 
and that A is independent of x k . 

Thus every realisation of a single signal transfer event in 



a given weighted network can be represented as a matrix 
operation A, independently of ISD. We denote the set of 
such matrices by A, and emphasise that it depends only 
upon the topology of the weighted network. 
It is clear that for N < oo the set A must be countable, 
and its cardinality must be \A\ — YiiLi where fe, = 
\Ni\. Moreover, it is clear that we can construct every 
element in A given uf =1 Ni. 

Following this, it is clear that given any ISD Xq the signal 
distribution at time point k > must have the (possibly 
non-unique) form 



X k = At 



AjX , 



(B4) 



where A, G A for i = 1, . . . , k. Whence Eq. (B2 1 can be 
expressed as 



P[X n = x\X ] = Yl = A*n-i = A n-i ■ ■ ■ ATXo] ■ ■ ■ P[Xi = AfX \X ]. 



(B5) 
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Thus to compute P[X l n |A ], it suffices to compute 
P[X k+1 = A T k+1 . . . AfA |A fe =A%... AJX q ] , 

which is simply the probability of the signal dynamic 
Ak+i being selected from A. By model construction this 
can be expressed as 



Whence combining this with (BI I and (B5 ) we derive the 
closed form expression 



N N 

IIEm 

i=l j=l 



n N N 

P[K = y\x )= £ ^ r ...^ 0)i nnE^(^^- ( B6 ) 

A u ...,A„eA " r=lk=lj=l 



1. Metric Space 

We demonstrated above how, for a specific network W, 
and fSD Xq we can calculate a set of matrices describing 
possible signal dynamics over a single time-step of our 
model, as well as a probability distribution describing 
the signal on the entire network after n time steps. We 
will denote these constructs for the weighted network W 
by Aw and Pw[X„|Ao], respectively and stress that the 
former only depends on the topology of W and not the 
edge weights, we will denote the topology of W by t(W) 
(topology in this context refers to the zero pattern of the 
network and is independent of the edge weights). The 
probability distribution f\y[X n |Xo] is a measure over the 
finite set 

{A*...AZx :(A i )? =1 cA w }, 
which we will denote Q% (t(W)). If we denote 

Afl 

the space of probability measures over iYt (t(W)) by 
Mf(iV% (t(W))), then it is clear that for any two 
weighted networks W\ and W% with the same topol- 
ogy, T, the probability distributions Pn^ [A„|Ao] an d 
P W2 [X n \X ] are elements of M^fiJ (T)). 
It has been shown that for any measurable space f2, the 
square root of the JSD induces a metric on the space 
M 1 + (fi) [5U], thus the quantity 

\J J SD(P Wl [X n \X ] , P W2 [X n \Xq\) 

computes a metric distance between the probability dis- 
tributions describing the dynamics on W\ and Wi- 
Thus our NTE framework results in a mapping from the 
space of weighted networks to a family of metric spaces 
in which elements of the metric space represent possible 
signal dynamics. 

Appendix C: Estimating NTE 

Network transfer entropy is formulated as the JSD be- 
tween two probability distributions. As explored above 
we can derive closed form expressions for these probabil- 
ity distributions, however, their evaluation can be com- 
putationally expensive, if there are multiple vertices of 
large degree. This is because a main step in the evalua- 
tion of the expressions is constructing the set A of pos- 
sible signal dynamics over a single time step, which for 
a network on N vertices is of dimension YiiLi ki- More- 
over, the time complexity of evaluation scales exponen- 
tially with the path length parameter n. 



I 

For most networks, however, estimation of the probabil- 
ity distributions involved in the NTE expression can be 
done efficiently. As the model underlying these distri- 
butions is a discrete time Markov chain, with a discrete 
state space, we can employ Monte Carlo simulation for 
any ISD to provide realisations of the signal distribution 
on the entire network, for any path length parameter 
n. From these realisations the probability of a specified 
signal level at vertex i, given an ISD and path length 
parameter n, can be estimated as the proportion of sim- 
ulations in which the specified level is achieved. 
Two major considerations need to be addressed to en- 
sure accurate estimation from this procedure. Firstly, 
it is clear that the more simulations of the model per- 
formed, the more accurate the estimate of the probabil- 
ity distribution, moreover the estimate computed from 
K simulations will converge to the true distribution as 
K — > oo. Thus it is essential to select K sufficiently 
large to ensure that the estimated distribution is suffi- 
ciently near the true distribution with high probability. 
Secondly, given a specified K it is important to establish 
how the error in estimating the probability distributions 
translates to error in estimating the NTE. 
To address the first issue, we consider only the full net- 
work i.e., without any vertices set to absorbing state, as 
the stochastic matrix for the full network will have the 
fewest deterministic vertices, and thus will be the hard- 
est to estimate probability distributions for. For each 
probability to be estimated we construct a trace plot de- 
scribing the change of the estimate with the number of 
simulations K. This plot allows us to assess convergence 
of the estimate as K is increased. We select the number 
of simulations K for each network as the maximal K such 
that, the shape of the trace plots indicates convergence 
and the estimates (for every vertex signal probability) at 
K and K — 100 differ by no more than 0.01. 
To address the second issue of error in the NTE after 
selecting K, we computed multiple (R) estimations of 
the signal probability distributions for the full network 
from K simulations. We then computed, for every (^) 
estimate pair, the JSD between the two estimates of the 
signal distributions at each vertex. This JSD, computed 
for vertex i, can be interpreted as the NTE from a vertex 
j to i, when j sends no information to i, if the estimation 
is perfect, this quantity should be zero. As the estimation 
is imperfect we obtain (^) estimates of the error in the 
NTE, deriving from error in the estimation of the prob- 
ability distributions from simulation, for each receiving 
vertex. From these estimates we can estimate the first 
two moments of the error distribution, a NTE for a given 
receiving vertex is considered not attributable to error, 
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provided it lies at least two standard deviations above 
the maximal error observed for the vertex. 



Appendix D: Computing the ISD and edge weights 
for the biological network 

To compute the NTEs over the two perturbations of 
the biological signalling network considered in the main 
text we must first define the ISD and the edge weights 
from the data. As the kinases in the network must 
be phosphorylated to phosphorylate their direct targets, 
two connected proteins with highly positively correlated 
phosphorylation levels across single cell observations un- 
der a given perturbation, are likely interacting. Thus 
a suitable edge weight which captures the strength of a 
phosphorylation interaction represented by an edge 
under a given perturbation k is 1 + C^, where C^- is 
the Pearson correlation of phosphorylated protein levels 
across single cell measurements under perturbation k. 
Defining the ISD is less trivial and requires consideration 
of the question we wish to answer and some technicali- 
ties. To determine the differences in information transfer 
between the two perturbations, it makes sense to consider 
an ISD which quantifies the difference in phosphorylated 
protein levels between the two treatments. A technical 
issue to consider is how different signal distributions on 
the same network lead to different NTE values between 
vertices. It is clear that weighting a vertex with a non- 
zero signal leads to a higher NTE value between that 
vertex and its downstream interaction partners (depend- 
ing on the value of n) than weighting the vertex with 
zero signal. Thus vertices with an information deficit in 
one perturbation versus another should be weighted with 
zero signal in that perturbation and a non-zero signal in 
the other. 

A more subtle issue concerns the number of unique signal 
values attainable at each vertex for a given ISD and how 
this relates to the NTE. It is somewhat intuitive that if 
the inputs to a given vertex each have a unique initial 
signal value, then the range of values attainable by the 
receiving vertex will be more diverse than if all the inputs 
had the same initial signal value. Thus one may hypoth- 
esise that the NTE from one vertex to another vertex, 
with multiple inputs, will be larger if the input vertices 
have unique initial signal values than if they have iden- 
tical signal values. 

To explain this concept, consider the network shown in 
Figure 4, consisting of one central vertex with 3 possi- 
ble inputs, which are each as likely to forward signal to 
central vertex as they are to forward signal to a separate 
independent neighbour. We consider the effect of the ISD 
on the NTE from the circled vertex to the central vertex 
for path length parameter n = 1 via two cases. In the 
first scenario the initial signal at every input vertex is 
identical (in this example this signal value is one), while 
the signal at every other vertex starts at zero. In the sec- 
ond case every input vertex is given a unique value (in 



our example these values are simply 1, 2 and 3), while the 
other vertices are again given initial signals of 0. Con- 
sider the probability distribution of signal at the central 
vertex after a single signal transfer event. For the first 
ISD, this distribution can take 4 unique values (namely 
0, 1, 2 and 3), however for the second ISD, with unique 
signal values at the inputs, the distribution can take 7 
values (integers through 6). Now consider the signal 
distribution at the central vertex given that we prevent 
the circled vertex from sending information. For the first 
ISD, this distribution now admits only 3 possible values 
(0, I and 2), while for the second ISD only 4 possible val- 
ues are now attainable (0, 2, 3 and 5). Thus the size of 
the "coding alphabet" at the central vertex after removal 
of input from the circled vertex has shrunk from 4 to 3 
under the first ISD but from 7 to 4 under the second ISD 
(a much greater fall). Consequentially, we notice that 
the NTE from the circled vertex to the central vertex is 
lower for the first ISD than the second ISD. 
If we follow this concept that inputs with unique initial 
signal send more information to their outputs, it is log- 
ical that vertices with significantly higher signal in one 
perturbation versus another should be given a unique ini- 
tial signal value in the first perturbation, reflecting their 
capacity to send more information about the network. 
All that remains now is to consider how to assign initial 
signal to vertices which do not display a great difference 
in signal distribution across the two perturbations. One 
solution to this problem is to assign all these vertices 
an identical initial signal, in this way they can trans- 
fer more information than vertices with a signal deficit 
in one perturbation versus another but less information 
than vertices with a signal surplus. 

Guided by these concepts we constructed the ISD for each 
perturbation as follows. We utilised the limma package 
in R 25 to compute t-values testing, for each vertex 
in the network, the hypothesis that the phosphorylated 
protein level of the vertex was significantly different in 
the two treatments. If for a given vertex the phosphory- 
lated protein levels were significantly lower (p < 0.05) in 
one perturbation versus another it was assigned an ini- 
tial value of zero in that perturbation and a unique initial 
value (here chosen as the absolute t-statistic of the test) 
in the other perturbation. All vertices which did not dis- 
play significant changes between the two perturbations 
were assigned the same non-unique initial value of I in 
both perturbations. 

The ISDs and edge weights for the two perturbations 
alongside NTE matrices computed for each perturbation 
are provided in Figure 5. 



Appendix E: Proof of the convergence principle 

In this appendix we prove the following theorem 

Theorem 1 (Convergence Principle) Let {W n ) ne ^ 
be a sequence of weighted networks on N vertices of fixed 
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Signal=0 

Signal=1 

Signal=2 

Signal=3 

All edges equally 
weighted 



ISD 


• 

A. 




NTE from circled 
vertex to central 
vertex for n=1 


0.065 


0.162 



FIG. 4: Comparing the NTE from source vertices to a target vertex when the ISD at source vertices are unique (right) and 
non-unique (left), note that a non-unique ISD at input vertices results in a higher NTE from source to target. 



topology, T, and let (P m ) me n C [0,l] NxN be the corre- 
sponding row normalised stochastic matrices for the se- 
quence. Let P C [0,l] NxN be a stochastic matrix of 
topology T. If P m P in LP{M NxN ), p > I, as 
m — > oo ; then for fixed ISD X and path length parameter 
n, the signal distributions 

P pm [X n \X ] ^P P [X n \X ] 

asm->oo in the metric space 

(M+{il%(T)),y/JSD(.,.)). 



A well-known and easy to derive bound on V spaces, 
which holds for any A G M NxN is WA]^ < \ \A\\ p . 
Fix e > 0. As P m -> P in LP(M NxN ), we have that: 
there exists MeM such that for all m > M, 



IP" 



P\ 



< e. 



Let us define the matrix A e ( — 1, l) yvx " via 



A P . = pM _ R 



1. Proof 

Firstly we define the LP norm of a matrix A G M JVxAr 

i/p 



\A\ 




if p < oo 
if p = oo 



(El) 



it is clear that ||A ||<x> < e - 

We now consider for a fixed ISD Xq and path length 
parameter n the distributions Pp[X n = x\Xq\ and 
PpM [X n = ^|Xo], which we will hereafter refer to as 
Pp(x) and PpM (f). It was shown above that 



>(*) = E 



Ai,...,A n eA 



S (AT...A%X ) 



n N N 
r— 1 2—1 j—1 



(E2) 



The set of possible signal dynamics A for a given 
weighted network was also explicitly constructed above 
and was shown to depend only on the network topology 
and not on the edge weights of the network. Consequen- 
tially as the sequence (P™)„ 6 n and the network P have 



the same topology T, the set of possible signal dynamics 
A is the same for every element of the sequence and the 
network P. 

Consider expanding the product 
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n JV JV 
r— 1 i — 1 j — 1 



n AT AT 

r— 1 z— 1 j — 1 

iV JV TV 

^(p^CAOy - E • • • (E«-(^)^- - E A £;)(^) 

2=1 2=1 2=1 2=1 

JV JV JV JV 

£(P#(AOy - E A y)(A»)y) ■ ■ • (E( p ^'( A «)^ - E *Nj)(A»)Nj) • (E4) 

2=1 2=1 2=1 2=1 



Grouping together terms we can express the product as 



n JV JV n N JV 

HUE 7, • 1 * - ihie^o* 

r— 1 i— 1 j — 1 r— 1 -i— 1 j — 1 

' N n N n N N 

E E ( E A S(^-)«) II II ( E p sK A i) 

. i-1 r-1 j — 1 l^r s^i j — 1 



+ o(e). 



r 



(E5) 



We will denote the second term in the above expression 
by 



N n JV n JV JV 

EE(E^(^)nn(E^(^) 



(E6) 



Substitution of ( E5 1 into ( E2 ) yields 



•(*) = E 6 



A 1: ...,A n GA 



(Af...ATX ) 



JV JV 



II II E ^ (A0« - H {Ai) ^ (A p ) + o(e) 

r—1 4—1 j—1 



(E7) 



Clearly from ( E2 ) the first term can be expressed 



JV JV 



Al,...,A„eA r=li=lj=l 



(E8) 



For the second term, notice that 
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E ^r...Ajxo)^(^)? = i( AF ) 

Ai,...,A„e^ 



AT n N 



* E ^f...^ 0) EE(Eii AP iu(^) 

Ai,...,A„e.4 i=l r=l j'=l 

n JV AT 

nn (E^).) 

/y^r s^z j — 1 

N n n N N 

< e E ^...a^o)EEIIII(E^(^) 



/ AT \ ™ 



(E9) 



where the second inequality follows from 1 1 A p | |oo < e and Let us define 



^2j = i{A r )ij — 1 by construction of the set A. The final 
inequality follows from the facts that J2f=i ^sj r (Ai) S j < 
1 and |.4| = Yl i=1 h. Given these bounds and Eq. (E7) 



it follows that 



m(x) := - (¥ p (x)+¥ p m(x)) 



/ N \ " 

» P (x) < F P M (x) + I Y[ h J nNe + o(e) . (E10) 



it follows from (|E10j) and ( |Ell| ) that 
P P (x) 2P P (x) 



An identical argument can be used exchanging P M and 
P, in which case the sign of H(Ai)v_ (A p ) in (E7) 
chan ges to positive, however the bound established in and 
(E9) bounds the modulus of 'H(A i )^_ 1 (^ P ) and thus will 
always be greatest than the largest negative or largest ] 
positive value of %{Ai)v-_ (A p ). Thus we obtain the sym- 
metric bound 



m(x) 



< 



2¥ P (x) - (n<=i nNe + o(e) 



2P p m (x) 



(E12) 



m ( x ) 2P p m (x) - (Uti nNe + o(e) 

/ N \ " 

p m (x) < Fp(x) + I Y[ h ] nTVe + (e). (Ell) Thus it follows that 



(E13) 



JSD< 



■ P ,r P M j 



< 



*^Pp(x)log 



pr- + PpM (X) lOg r — 

2P P (x) 



+ log 



2P p m (a;) 



2PpM (x) - ( n»=i fcj ) niVe + o(e) 



(E14) 



By algebra of limits, it is clear that as e — >• 
2P P (x) 



and 



2P pA f (x) 



2P 



pW-(n"i^) n ^+»w 



2Pp« (x) - (nti nN ^ + 
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Whence it follows that 

JSD(¥ P ,¥ pm ) -> 

as m — > oo and the theorem is true. 
We note that the theorem also holds if the topologies 
of the sequence of weighted networks and limit network 
are different from one another, the proof of this state- 
ment follows precisely as above, the only difference being 
the set A utilised is that induced by the complete graph 
topology. 

Appendix F: Network evolution and dynamic 
programming 

As mentioned in the main text the evolution of self- 
assembling networks can also be considered in our frame- 
work via an application of dynamic programming. To 
see this we consider the space A (explicitly constructed 
above) containing matrix representations of all possible 



single path length, signal forwarding choices, induced by 
the complete graph on K vertices. We note that for every 
weighted network, W, on N vertices, where N < K, the 
corresponding stochastic matrix P can be expressed as 

K N 

a convex combination of elements in A, P = X^=i Pj-^j 

where {Aj}f =l = A, J2j Pi = ^ Pj — 1 f° r au 3- ^ onc 
interprets the space A as a state space of possible choices 
of signal dynamics through the network and considers 
P = {pj}f=i as a policy, giving a probability distribution 
of selecting a given global signal dynamic from the state 
space, that has been obtained by some optimality crite- 
rion, then one has a dynamic programming framework for 
network dynamic evolution. We note that one can cal- 
culate the policy explicitly, as pi = P[X\ = ^4jAo|^o]- 
Thus we have information to guide to construction of an 
optimality criterion describing network evolution. Forms 
of such a criterion can be posited and parameterised for 
different systems and suitable parameter regimes can be 
reverse engineered from the policy solution p. 
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ISD-treated with only anti-CD3 
and anti-CD28 



Receiving vertex 




Signal=1 
Low signal 
Moderate signal 
High signal 
No signal 



Low interaction 
probability 
Moderate interaction 
probability 
High interaction 
"probability 




Sending vertex 



ISD-treated with anti-CD3, anti- 
CD28 and Psitectorigenin 



Receiving vertex 





Sending vertex 



FIG. 5: The left hand side of the figure shows the ISD and edge weights for the anti-CD3 and anti-CD28 treated network (top) 
and for the anti-CD3, anti-CD28 and psitectorigenin treated network (bottom). The right hand side displays matrices of NTEs 
computed between every vertex pair over path lengths n = 1 — 5, in the anti-CD3 and anti CD28 treated network (top) and 
the anti-CD3, anti-CD28 and psitectorigenin treated network (bottom) 



